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Abstract 

Modem classification problems frequently present mild to severe label imbalance as well as specific requirements 
on classification characteristics, and require optimizing performance measures that are non-decomposable over the 
dataset, such as F-measure. Such measures have spurred much interest and pose specific challenges to learning 
algorithms since their non-additive nature precludes a direct application of well-studied large scale optimization 
methods such as stochastic gradient descent. 

In this paper we reveal that for two large families of performance measures that can be expressed as functions of 
true positive/negative rates, it is indeed possible to implement point stochastic updates. The families we consider are 
concave and pseudo-linear functions of TPR, TNR which cover several popularly used performance measures such 
as F-measure, G-mean and H-mean. 

Our core contrihution is an adaptive linearization scheme for these families, using which we develop optimization 
techniques that enable truly point-based stochastic updates. For concave performance measures we propose SPADE, 
a stochastic primal dual solver; for pseudo-linear measures we propose STAMP, a stochastic alternate maximization 
procedure. Both methods have crisp convergence guarantees, demonstrate significant speedups over existing methods 
- often hy an order of magnitude or more, and give similar or more accurate predictions on test data. 


1 Introduction 


Learning applications with binary classification problems involving severe label imbalance abound, often accompanied 
with specific requirements in terms of false positive, or negative rates. Examples included spam classification, anomaly 
detection, and medical applications. Class imbalance is also often introduced as a result of the reduction of a problem 
to binary classification, such as in multi-class problems Bishop 1 2006) and multi-label problems due to extreme label 
sparsity Hsu et al. P009| . 

Traditional performance measures such as misclassification rate are ill-suited in such situations as it is usually 
trivial to optimize them by constantly predicting the majority class. Instead, the performance measures of choice 
in such cases are those that perform a more holistic evaluation over the entire data. Naturally, these performance 
measures are non-decomposable over the dataset and cannot be cannot be expressed as a sum of errors on individual 
data points. Popular examples include F-measure, G-mean, H-mean etc. 

A consistent effort directed at optimizing these performance measures has, over the years, resulted in the de¬ 
velopment of two broad approaches - 1) surrogate based approaches (e.g. SVMPerf [Joachims et al.| )2009[ ) that 
design convex surrogates for these performance measures, and 2) indirect approaches which include cost-sensitive 
classification-based approaches Parambath et al.j [2014) which solve weighted classification problems, and plug-in 


approaches Koyejo et al. 1 12014) , [Narasimhan et al.|12014| which rely on consistent estimates of class probabilities. 


* Work done while H.N. was an intern at Microsoft Research India, Bangalore. 
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Both these approaches are known to work fairly well on small datasets but do not scale very well to large ones, 
especially those large enough to not even ht in memory. S VMPerf-style approaches, which employ cutting plane meth¬ 
ods do not scale well. On the other hand, plug-in approaches first need to solve a class probability estimation problem 
optimally and then tune a threshold. This two-stage approach prevents the method from exploiting better classifiers 
to automatically obtain better thresholds. Moreover, for multi-class problems with C classes, jointly estimating C 
parameters can take time exponential in C. 

For large datasets, streaming methods such as stochastic gradient descent Shalev-Shwartz et al. 1 20111 that take 
only a few passes over the entire data are preferable. However, traditional SGD techniques cannot handle non- 
decomposable losses. Recently, |Kar et aTT | 2014) proposed optimizing SVMPerf-style surrogates using SGD tech¬ 
niques. Although their method is generic, allowing optimization of performance measures such as F-measure and 
partial AUC, they require maintaining a large buffer to compute online gradient estimates that can be prohibitive. 

Motivated by the state of the art, we develop novel methods for optimizing two broad families of non-decomposable 
performance measures. Our methods incorporate truly point-wise updates, i.e. do not require a buffer, and require 
only a few passes over data. At an intuitive level, at the core of our work are adaptive linearization strategies for 
these performance measures, which make these measures amenable to SGD-style point-wise updates. Moreover, our 
linearizations are able to feed off the improvements made in learning a better classifier, resulting in faster convergence. 

We consider two classes of performance measures 

Concave Performance Measures (see Table [^: These measures can be written as concave functions of true 
positive (TPR) and negative (TNR) rates and include G-mean, H-mean etc. We exploit the dual structure of these 
functions via their Fenchel dual to linearize them in terms of the TPR, TNR variables. Our method then, in parallel, 
tunes the dual variables in this linearization and maximizes the weighted TPR-TNR combination. These updates are 
done in an online fashion using stochastic mirror descent steps. 

Pseudo-linear Performance Measures (see Table [^; These measures can be written as fractional linear func¬ 
tions of TPR, TNR and include F-measure and the Jaccard coefficient. These functions need not be concave and the 
techniques outlined above do not apply. Instead, we exploit the pseudo-linear structure to linearize the function and 
develop a technique to alternately optimize the combination weights and the classiher model via stochastic updates. 
Although such “alternate-maximization” strategies in general need not converge even to a local optima, we show that 
our strategy converges to an e-approximate global optimum after log (^) batch updates or 0(l/e^) stochastic updates. 

Finally, we present an empirical validation of our methods. Our experiments reveal that for a range of performance 
measures in both classes, our methods can be signihcantly faster than either plug-in or SVMPerf-style methods, as 
well as give higher or comparable accuracies. 


2 Related Works 


As noted in Section [T] existing methods for optimizing performance measures that we study can be divided into 
surrogate-based approaches and indirect approaches based on cost-sensitive classihcation or plug-in methods. A third 
approach applicable to certain performance measures is the decision-theoretic method that learns a class probability 
estimate and computes predictions that maximize the expected value of the performance measure on a test set [Lewis] 

| fT^ , [Wetar] 


_ ^m7\. _ 

For instance [Parambath et al. | 2014| focus on optimizing F-measure by exploiting the pseudo-linearity of the 
function along with a cross validation-based strategy. Our STAMP method, on the other hand uses an alternating 
maximization strategy that does not require cross validation which considerably improves training time (see Figure]^. 
It is important to note that these performance measures have also been studied in multi-label settings where these no 

2013| study plug-in style methods for maximizing 


In addition to these there exist methods dedicated to specific performance measures. 


longer remain non-decomposable. For instance, Dembczyhski et al. 


F-measure in multi-label settings whereas works such as Koyejo et a 


I M4l , [NSisimhan et al.| | |2^ , |Ye et aLjpml | 


study plug-in approaches for the same problem in the more challenging binary classification setting. 

Historically, online learning algorithms have played a key role in designing solvers for large-scale batch problems. 
However, for non-decomposable loss functions, dehning an online learning framework and providing efficient algo¬ 
rithms with small regret itself is challenging. jRakhlin et al.| | |20TT| propose a generic method for such loss functions; 
however the algorithms proposed therein run in exponential time. [Kar et al. 1 2014) also study such measures with the 
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aim of designing stochastic gradient-style methods. However, their methods require a large buffer to be maintained, 
which causes them to have poorer convergence guarantees and in practice be slower than our methods. 

By exploiting the special structure in our function classes, we are able to do away with such requirements. Our 
methods make use of standard online convex optimization primitives Zinkevich | 2003| . However, their application 
requires special care in order to avoid divergent behavior. 


3 Problem Setting 

Let A’ C denote the instance space and y = {—1,+1} the label space, with some distribution V over X x y. 
Let p := Pr [y — -fl] denote the proportion of positives in the population. Let T = {(xi, j/i),..., (xt, Vt)} 

denote a sample of training points sampled i.i.d. from V. For sake of simplicity we shall present our algorithms and 
analyses for a set of linear models W C Let and denote the radii of the domain X and hypothesis class 
W respectively. 

We consider performance measures that can be expressed in terms of the true positive and negative rates of a 
classifier. To represent these measures, we shall use the notion of a reward function r that assigns a reward r{y, y) to 
a prediction y gM. when the true label isy G y. We will use 

r+(w;x,j/) = - • r(y,w^x) • l{y = 1) 

P 

r~{w;x,y) = - -r(y,w^x) • l(y = -1) 

1-p 

to calculate rewards on positive and negative points. Since E |r+(w;x,y)] = E |r(y,w^x)|?/= l], setting 

(x.y) (x,y) “■ 

r'^~^{y,y) = 1 {yy > 0) gives us E |r+(w; x, y)] = TPR(w). For sake of convenience, we will use P(w) = 
E Ir"*" (w; X, y)] and N (w) = E |r“ (w; x, ?/)] to denote population averages of the reward functions. We shall 

(x,y) (x,y) 

assume that our reward functions are concave, Lj.-Lipschitz, and take values in a bounded range [— Br]. 


4 Concave Performance Measures 

The first class of performance measures we analyze are concave performance measures. These measures can be written 
as concave functions of the TPR and TNR i.e. 

7^^(w) = tp (P(w),7V(w)) 


for some concave link function T' : —> K. A large number of popular performance measures fall in this family since 
these measures are relevant in situations with severe label imbalance or in situations where cost-sensitive classification 
is required such as detection theory Vincent 119941. Table [T] gives a list of such performance measures along with 
some of their relevant properties and references to works that utilize these performance measures. 

We shall find it convenient to define the (concave) Fenchel conjugate of the link functions for our performance 
measures. For any concave function tp and a, /3 G K, define 


T'*(a,/3) = inf {au + jdv —'^{u,v)} . 


By the concavity of 'P, we have, for any u,vG 


T'(u,u)= inf {au + Pv —'if*{a, P)} . 


We shall use the notation 4* to denote, both the link function, as well as the performance measure it induces. 
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Table 1: List of concave performance measures T'(P, TV) along with their monotonicity and Lipschitz properties, 
sufficient dual regions, and expressions for dual subgradients. 6(0, r) denotes the ball of radius r around the origin, 
denotes the positive quadrant. 


Name 

Expression {P, N) 

Mon.? 

Lip.? 

b(e) 

Sufficient dual Region A* 

V'k*(a,/?). 

!M!in |jvincent|!994 } 

min{P, N} 

/ 

/ 

e 

{a- 1 - ,5 = 1} n R+ 

0 

H'lnCSn (iKennedy et al. 

2010 

) 

2PN 

P+N 

/ 

/ 

4£ 

{^+./P>s/2}r\B{0,2) 

0 

Q-mCcin dLiu and Chawla 

2011 

) 

1 ^ (t-Pp + (^-Np 

/ 

/ 

e 

{o? < 1/2} nK+ 

1 

G'lTlCcin IlDaskalaki et al. 

2006 

) 

s/PN 

/ 



{aP > 1/4} n 

0 


Algorithm 1 SPADE: Stochastic PrimAl-Dual mEthod 
Input: Primal/dual step sizes feasible sets W, A 

Output: Classifier w G W 

1: Wo t— 0, t t— 1 

2: while data stream has points do 

3: Receive data point (xt, j/t) 

4: /* Perform primal ascent */ 

5: if J/t > 0 then 

6: wt+i IIw (wf +rit ■ atVwr+(wt; xt, j/t)) 

7: else 

8: wt+i ^ IIw (wt +r]t ■ /3t Vw^” (wt; xt, j/t)) 

9: end if 

10: /* Perform dual descent */ 

11: (a, 6) (at, A) - Pt ' Pt) 

12: if J/t > 0 then 

13: a-^ a-p[-r+{vjt\:>ct,yt) 

14: else 

15: 6 ^ 6-r/( • r“(wt;xt, J/t) 

16: end if 

17: (at+i,/?t+i) <—n^((a, b)) 

18: + 1 

19: end while 
20: return w = 


4.1 A Stochastic Primal-dual Method for Optimizing Concave Performance Measures 


We now present a novel online stochastic method for optimizing the class of concave performance measures. The 
use of stochastic gradient techniques for these measures presents specific challenges due to the non-decomposable 
nature of these measures which makes it difficult to obtain cheap, unbiased estimates of the gradient using a single 
point. Recent works Kar et al. |2013 2014| have tried to resolve this issue by looking at mini-batch methods or by 
using a buffer to maintain a sketch of the stream. However, such techniques bring in a bias into the learning algorithm 
in the form of b uffer size or mini batch length which results in slower convergence. Indeed, the IPMB method of 


Kar et al. 


120141 is only able to guarantee a s/T rate of convergence, whereas SGD techniques are usually able to 


guarantee rates. This is indicative of suboptimal performance and our experiments confirm this (see Figure]^. 

Here we show that for the class of concave performance measures, such workarounds are not necessary. To this 
end we present the SPADE algorithm (Algorithm [T]) which exploits the dual structure of the performance measures 
to obtain efficient point updates which do not require the use of mini-batches or online buffers. SPADE is able to 
offer convergence guarantees identical to those that stochastic methods offer for additive performance measures such 
as least squares, without the presence of any algorithmic bias. 

Let W C X and A^i C be convex regions within the model and dual spaces respectively, and Hw and H^^ 
denote projection operators for these. Table [T] lists the relevant dual regions for the performance measures listed 
therein. 
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4.2 Convergence Analysis for SPADE 

This section presents a convergence analysis for the SPADE algorithm. The convergence proof is formally stated in 
Theorem]^ Apart from demonstrating the utility of the algorithm, the proof also sheds light on the choice of algorithm 
parameters, such as primal/dual feasible regions. 

We shall work with performance measures that are monotonically increasing in the true positive and negative rates 
of the classifier i.e. if u > u', v > v' then v) > v'\ This is a natural assumption and is satisfied by all 

performance measures considered here (see Table [^l. We now introduce two useful concepts. 

Definition 1 (Stable Performance Measure). A performance measure T' will be called (5-stable if for some function 
(5 : K —>■ K, we have for all rt, w G M and e € K+, 

T' (m + e, u + e) < T'(u, ■(;) + i5(e). 

Table[^lists the stability parameters of all the concave performance measures. Clearly, a performance measure has 
a linear stability parameter i.e. S{e) < L ■ e iff its corresponding link function is Lipschitz. We now define the notion 
of a sufficient dual region for a performance measure 

Definition 2 (Sufficient Dual Region). For any link function T', define its sufficient dual region C to be the 
minimal set such that for all (rt, v) G R^, we have 

A’{u,v) = inf {au + f3v —{a, P)} . 

The reason for defining this quantity will become clear in a moment. A closer look at Algorithm [T] indicates 
that it is performing online gradient descent steps with the dual variables. Clearly, for this procedure to have statistical 
convergence properties, the magnitude of the updates must be bounded in some sense otherwise the learning procedure 
may diverge. This motivates the projection step in Step 17. However, in order for the updated dual variables to be 
informative about the current primal function value, the projection step must be done in a way that does not distort the 
link function. The notion of a sufficient dual region formally captures the notion of such a projection step. 

Having said that, there is no apriori guarantee that the sufficient region for a given performance measure would be 
bounded, in which case this entire exercise counts for naught. However, the following lemma, by closely linking the 
stability properties of a performance measure with the size of its sufficient dual region, shows that for well-behaved 
link functions, this will not be the case . 

Lemma 3. The stability parameter of a performance measure T'(-) can be written as (5(e) < • e iff its sufficient 

dual region is bounded in a ball of radius Lqr. 

The proof of this result follows from elementary manipulations and can be found in Appendix In some sense 
this result can be seen as a realization of the well known connection between the Fenchel dual of a function and its 
Lipschitz properties. 

To simplify the initial analysis, we shall first concentrate on performance measures whose link functions are 
Lipschitz. It is easy to see that these are exactly the performance measures whose gradients do not diverge within any 
compact region of the real plane. Of the performance measures listed in Table all measures except G-mean have 
associated link functions that are Lipschitz. Subsequently, we shall address the more involved case of non-Lipschitz 
performance measures such as G-mean as well. 

Theorem 4. Suppose we are given a stream of random samples (xi, yi),..., (XT', ?/t) drawn from a distribution T) 
over X Y,y. Let T'(-) be a concave, Lipschitz link function. Let Algorithm^be executed with a dual feasible set 
A 3 A^, r]t = f/sjt and r][ — fjsjt. Then, the average model w = output by the algorithm satisfies, 

with probability at least 1 — (5, 

T’^(w) > sup Vm{w*) - O i S^s, i \ ^\ogj 
WGW y yy T 6 

We refer the reader to Appendix [B| for a proof and explicit constants. The proof closely analyzes the primal ascent 
and dual descent steps, tying them together using the Fenchel dual of tk. 
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Table 2; List of pseudo-linear performance measures 7^(a,b)(^i -^) along with their popular forms, canonical expres¬ 
sions in terms of (reward functions representative of) true positive (P) and negative (N) rates, monotonicity properties, 
acceptable range of reward values, and rate of convergence of the Alt-Max procedure for the performance measure 
when rewards take values in the range [0, to). 


Name 

Popular Form 

Canonical Form {P, N) 

Mon.? 

Range P, N 

Rate ri{m) 

1 Manning et al.^ 

(l+;3^).TP 

(l+/3^)-P 

/ 

(o,i + f) 

m(l + 0) 

(1+P^)-TP+P^-TP+FP 

l3^+0+P-B-N 

m + P^+e 

Jaccard Coefficient jKoyejoemi) 

TP 

P 

/ 

(O.T) 

me 

1 + 0 

TP+FP+FN 

1 + 0-0.iV 

Gower-Legendrec < 1 

Sokolova and Lapalme 


TP+TN 

p+e-N 

/ 

(0,oo) 

(1 —cr)m 

TP+cTiFP+FN)+TN 

o-(l + 0) + (l cr)-P + 0(l-cr)-iV 

(l — cr)m+CT 

Gower-Legendrec > 1 

Sokolova and Lapalme 


TP+TN 

P+e-N 

/ 

b 1 

1 b 

(cr —l)m 

TP+ct{FP+FN)+TN 

CT{l + e) + {l-CT)-P + e{l-CT)-N 

CT 


4.3 The Case of non-Lipschitz Link Functions 

Non-Lipschitz link functions, such as the one used in the G-mean performance measure, pose a particular challenge 
to the previous analysis. Owing to their non-Lipschitz nature, their sufficient dual region is unbounded. Indeed as 
Table [T] indicates, the sufficient region for 'kc-mean extends indefinitely along both coordinate axes. More precisely, 
what happens is that the gradients of the ^'c-mean function diverge as either u —0, or u —?> 0. This poses a stumbling 
block for the proof of Theorem]^ since the regret and online-to-batch conversion results used therein fail. 

A natural way to solve this problem is to ensure that the reward functions r”*", r~ always assign rewards that are 
bounded away from zero. More specifically, for some e > 0, we have r+(w; x, y), r“(w; x, y) > e for all w S W 
and X G X ■ For this restricted reward region, one can show, using Lemma that the sufficient dual region can be 
restricted to a ball of radius O . 

The above discussion suggests that we regularize the reward function i.e. at each time step t, we add a small 
value e(t) to the original reward function. However, the amount of regularization remains to be decided since over 
regularization could cause our resulting excess risk bound to be vacuous with respect to the original reward function. 
It turns out that setting e{t) « strikes a fine balance between regularization and fidelity to the original reward 
function - this seems intuitive since the regularization becomes milder and milder as learning progresses. The following 
extension of Theorem|4]formalizes this statement. 

Theorem 5. Suppose we have the problem setting in Theorem^^ith the '^c-mem performance measure being optimized 
for. Consider a modification to Algorithm^wherein the reward functions are changed to rf (•) = -|- e(f), and 

rf {■) = r~ {■) + e{t) for e(f) = Then, the average model w = T Wj output by the algorithm satisfies, 

with probability at least 1 — <5, 


> sup -O 

w* GW 



The proof of this theorem can be found in Appendix We note here that primal dual frameworks have been 
utilized before in diverse areas such as distributed optimization |Jaggi et ak] ]2014 [ and multi-objective optimization 
Mahdavi et al. |2013|. However, these works simply assume the functions involved therein to be Lipschitz and/or 


smooth and do not address cases where they fail to be so. Theorem|^on the other hand, is able to recover a non-trivial, 
albeit weaker, statement even for locally Lipschitz functions. 


5 Pseudo-linear Performance Measures 


The second class of performance measures we analyze are pseudo-linear performance measures. These measures have 
a fractional linear function as the link function and can be written as follows: 


^(a.b)(w) = 


oo + oi • P(w) -I- 02 • N(yw) 
bo + bi ■ P(w) -b &2 ■ N{-w) 
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Algorithm 2 AMP: Alternate Maximization Procedure 
Input: Performance measure P{a,b) > feasible set W, tolerance e 
Output: An e-optimal classifier w £ W 
1: Construct valuation function V(a,b) 

2: Wo 0, f •«— 1 

3: while vt > vt-i + e do 

4: wt+i ^ argmax^gy^; V(a.b)(w,Wt) 

5: vt+i ^ arg max^^g v such that V(a,b) (wt+i ,v) >v 

6 : t-^t + 1 

1: end while 
8: return wt 


for some weighing coefficients a, b. Several popularly used performance measures, most notably the F-measure, can 
be represented as pseudo-linear functions. Table [^enumerates some popular pseudo-linear performance measures as 
well as their properties. 

We note that these performance measures are usually represented in literature using the entries of the confusion 
matrix. However, for the sake of our analysis, we shall find it useful to represent them in terms of the true positive and 
true negative rates. To do so, we shall use p to denote the proportion of positives in the population and 9 = to 
denote the label skew. 


5.1 Alternate-maximization for Optimizing Pseudo-linear Performance Measures 


Pseudo-linear functions are named so since their level sets can be defined using linear half-spaces. More specifically, 
every pseudo-linear function T* over has an associated “level-finder” function a : M —?► and 6 : K —K such 

that 'l'(v) > t iff (v,a(f)) > b{t). We refer the reader to Parambath et al. | 2014) for a more relaxed introduction 
to these functions and their properties. For our purposes, however, it suffices to notice that this property immediately 
points toward a cost-sensitive method to optimize these performance measures. 

This fact was noticed by Parambath et al. | 2014) who exploited this to develop a cost-sensitive classification 
method for optimizing the F-measure by simply searching for the best weights with which to perform cost-sensitive 
classification. However, we notice that instead of performing such a brute force search, one can adaptively tune the 
weights to better and better values and obtain much faster convergence. To develop this intuition, we first define the 
notion of a valuation function below. 


Definition 6 (Valuation Function). The valuation function of a performance measure V^a.hpfor a classifier w £ W, 
and at a level u £ M is defined as 


V(a,b)(w,'u) := c+{a-v-i)- P(w) -f - v6) ■ N{w), 


wherec= ^,a= = 


h. 

bo ’ 



The following well-known lemma closely links the valuation function to the performance measure. 


Lemma 7. For any performance measure 'P(a.b). w £ W and u £ M w have ‘P(a,b)(w) > v V(a.b)('w, v) > v. 
Moreover, in such a situation we say that classifier w has achieved valuation at level v. 

Lemmaj^indicates that the performance of a classifier is intimately linked to its valuation. This suggests a natural 
alternate maximization approach wherein we alternate between posing a challenge level to the classifier and training 
a classifier to achieve that level. The resulting algorithm AMP is detailed in Algorithm]^ Note that using Lemma [^ 
step 5 in the algorithm can be executed simply by setting vt+i = ’P(a,b) (wt+i). Thus, in a very natural manner, the 
current classifier challenges the next classifier to beat its own performance. It turns out that this approach results in 
rapid convergence as outlined in the following theorem. 

Theorem 8. Let Algorithm^be executed with a performance measure V^g^h) tind reward functions that offer values 
in the range [0, m). Let V* '.= sup.^^gyy ’P(a.b) (w). Also let At = V* — P(a,b) (wt) be the excess error for the model 
Wt generated at time t. Then there exists a value rj{m) < 1 such that At < Aq ■ r](rnY. 
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Algorithm 3 STAMP: STochastic Alt-Max Procedure 


Input: Feasible set W, Step sizes rjt, epoch lengths Se, Se 
Output: Classifier w G W 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 


V i — 0, t i — 0, 6 i — 0, Wq t— 0 

repeat 

/* Model optimization stage */ 
w -t— We 

while t < Se do 

Receive sample (x, y) 
w ^ w -I- ?7tVw ((1 - 

end while 

t — 0, 6 ■t— 6 “t" 1, We+1 t— W 
/* Challenge level estimation stage */ 
t— 0, u- t— 0 

while t < Sg do 

Receive sample (x, y) 

Vy t- Vy -|-r^(we;x,t/) 

i ^ 7 + 1 

end while 

t — 0, Ve — oX- 

until stream is exhausted 

return We 


(w;x,j/) -I- 'fr 


■(w;x,j/)) 


The proof of this theorem can be found in Appendix [D| Table gives values for the convergence rates of all the 
pseudo-linear performance measures, as well as the allowed range of values that the reward functions can take for those 
measures. This is important since performance measures such as the F-measure diverge if the reward function values 
approach 2. Other performance measures like the Gower-Legendre measure do not impose any such restrictions. Note 
that the above result shows that Algorithm|^will always terminate in O (log i) steps. 

At this point it would be apt to make a historical note. Pseudo-linear functions have enjoyed a fair amount of 
interest in the optimization community lSchaible 1 1976| , Dinkelbach 11967 |, |Jagannathan 1 1966| within the sub-field of 
fractional programming. Of the many methods that have been developed to optimize these functions, the Dinkelbach- 
Jagannathan (DJ) procedure Dinkelbach 119671, Jagannathan| | 1966| is of specific interest to us. It turns out that the 
AMP method can be seen as performing DJ-style updates over parameterized spaces (the parameter being the model 
w). It is known (for instance see |Schaible||I97^ ) that the DJ process is able to offer a linear convergence rates. Our 
proof of Theorem]^ which was obtained independently, can then be seen as giving a similar result in the parameterized 
setting. 

However, we wish to move one step further and optimize these performance measures in an online stochastic 
manner. To this end, we observe that the AMP algorithm can be executed in an online fashion by using stochastic 
updates to train the intermediate models. The resulting algorithm STAMP, is presented in Algorithm]^ However, this 
algorithm is much harder to analyze because unlike AMP which has the luxury of offering exact updates, STAMP 


offers inexact, even noisy updates. Indeed, even existing works in the optimization community (for example Schaible 
1 1976| ) do not seem to have analyzed DJ-style methods with noisy updates. 

Our next contribution hence, is an analysis of the convergence rate offered by the AMP algorithm when neither 
of the two maximizations is carried out exactly. For the sake of simplicity, we present the STAMP algorithm and its 
analysis for the case of Fi measure. Suppose at each time step, for some ct > 0, 5*, we have 


I/(w(+i,ut) = max I/(w,ut)-et 
wG W 

vt = F(wt) -F St, 
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Figure 1: Comparison of stochastic primal-dual method (SPADE) with baseline methods on QMean maximization 
tasks. SPADE achieves similar/better accuracies while consistently requiring about 3-4x less time than other baseline 
approaches. 
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Figure 2: Comparison of stochastic primal-dual method (SPADE) with baseline methods on Min-TPR/TNR maxi¬ 
mization tasks 


then for some p < 1, we have 


T-l 

At < -I- ^ (|5t| + e*) 

i=0 

As a corollary we present a convergence analysis for the STAMP algorithm in Theorem]^ 

Theorem 9. Let Algorithm^be executed with a performance measure ‘P(a,b) cind reward functions with range [0, m). 
Let 77 = r]{m) be the rate of convergence guaranteed for "Pya.b) by the AMP algorithm. Set the epoch lengths to 

SejSg = O ^ 7757 ^ Then after e = logi (Mog^ i) epochs, we can ensure with probability at least \ — 5 that 

V* — T’(a,b) (wg) < e. Moreover the number of samples consumed till this point is at most O (^). 

The convergence analysis for noisy AMP can be found in Appendix]^ The proof of this theorem can be found in 
Appendix]^ Both results require a fine grained analysis of how errors accumulate throughout the learning process. 


6 Experimental Results 

We shall now compare our methods with the state-of-the-art on various performance measures and datasets. 

Datasets: We evaluated our methods on 5 publicly available benchmark datasets: a) PPI, b) KDD Cup 2008, c) 
IJCNN, d) Covertype, e) MNIST. All datasets exhibited moderate to severe label imbalance with the KDD Cup 2008 
dataset having just 0.61% positives. 

Methods: We instantiated the SPADE algorithm (Algorithm T 1 on the Q-mean and Min-TPR/TNR performance 
measures. We also instantiated the STAMP method (Algorithm 3 1 on Fl-measure and the JAC coefficient. In both 
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Figure 3: Comparison of stochastic alternating minimization procedure (STAMP) with baseline methods on FI maxi¬ 
mization tasks 
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Figure 4: Comparison of stochastic alternating minimization procedure (STAMP) with baseline methods on JAC 
maximization tasks 


cases we compared to the SVMPerf method Joachims et al. 1 2009| and plug-in method Koyejo et al. |2014| special¬ 
ized to these measures. For the sake of reference, we also compared to the standard logistic regression method for 
(unweighted) binary classification. Additionally for FI-measure, we also compared to the IPMB stochastic gradient 
descent method proposed recently by Kar et al. | 2014| . All methods were implemented in C. 

Parameters: We used 70% of the dataset for training and the rest for testing. Tunable parameters, including 
thresholds for the plug-in approaches, were cross-validated on a validation set. All results reported here were averaged 
over 5 random train-test splits. We used hinge-loss based reward functions for our methods. STAMP was executed 
by setting the challenge level to the actual F-measure/JAC at each stage. We used a state of the art LBFGS solver 
to implement the plug-in methods and used standard implementations of the SVMPerf algorithm. Since our methods 
are able to take a single pass over the data very rapidly, SPADE was allowed to run for 25 passes over the data and 
STAMP was allowed 25 passes with an initial epoch length of 100 which was doubled after every iteration. The 
SVMPerf algorithm was allowed a runtime of up to 50 x of that given to our method after which it was terminated. 
The LBFGS solver was always allowed to run till convergence. 

Figures [T] and [^compare the SPADE method with the baseline methods for the Q-mean and Min-TPR/TNR mea¬ 
sures. In general, SPADE was found to offer comparable or superior accuracies with greatly accelerated convergence 
as compared to other methods. On the IJCNN and Covtype datasets, SPADE outperformed every other method by 
about 2-3%. As SPADE is a stochastic first order method, it is expected to rapidly find out a fairly accurate solution. 
Indeed, the method was found to offer greatly accelerated convergence without fail. For instance, on the MNIST 
dataset, SPADE found out the best solution as much as 60 x faster than any other method whereas on the KDD Cup 
and PPI datasets it was 12 x and 2x faster respectively. The SVMPerf method, on the other hand, was found to be ex¬ 
tremely slow in general and require at least an order of magnitude time more than SPADE to find reasonably accurate 
solutions. It is also notable that in all cases, simple binary classification gave very poor accuracies due to the severe 
label imbalance in these datasets. 

Figures and 1^ report the performance of the STAMP method applied to pseudo-linear functions. Similar to the 
concave measures, STAMP was found to provide competitive accuracies as compared to the baseline methods but 
require at least 3 — 4x less computational time. Interestingly, for the Fl-measure, the IPMB method, which is another 
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stochastic gradient descent-based method, was found to struggle to obtain accuracies similar to that of STAMP or 
else offer much slower convergence. We suspect two main reasons for the suboptimal behavior of this other stochastic 
method. Firstly these results confirm the adverse effect of the dependence on an in-memory buffer on these methods. 
It is notable that this dependence causes even the theoretical convergence rates for these methods to be weaker as was 
noted earlier in the discussion. Secondly, we note that both SVMPerf and IPMB optimize the same “struct-SVM” 
style surrogate for the F-measure Kar et al. | 2014| . This surrogate has been observed to give poor accuracies when 
compared to plug-in methods in several previous works Koyejo et al. | 2014| , Narasimhan et al. [ 2014| . STAMP on 
the other hand, works directly with F-measure in a manner similar to, but faster than, the plug-in methods which might 
explain its better performance. 
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A Proof of Lemma |3] 

Lemma 3. The stability parameter of a performance measure '!'(•) can be written as (5(e) < L^, ■ e iff its sufficient 
dual region is bounded in a ball of radius 0 

Proof Let us denote primal variables using the notation x = [u, v) and dual variables using the notation 9 = {a,f3). 
The proof follows from the fact that any value of 6 for which = —oo can be safely excluded from the sufficient 

dual region. 

For proving the result in one direction suppose T* is stable with (5(e) = Le for some L > 0. Now consider some 
0 S such that ||0||2 > L. Now set xc = —C ■ 9. Then we have 

^-*(0) = inf {(0,x) - T'(x)} 

X 

= 11^112 -'I^(xc)} 

< mfJ-C||0||^-'I'(O) +CL 11011^} 

< mfJ-C||0||^-'I'(O) + CL||0||2} 

= my-C||0||2(|10|l2-i)}-'I'(O) 

< my-C|10||2-v['(O)} 

= —oo 

Thus, we can conclude that no dual vector with norm greater than L can be a part of the sufficient dual region. This 
shows that the sufficient dual region is bounded inside a ball of radius L. For proving the result in the other direction, 
suppose the dual sufficient region is indeed bounded in a ball of radius R. Consider two points xi, X 2 such that 

9l = argmin{(0,xi) - ^-*(0)} 

e^A^s, 

0 ; =argmin{(0,X2)-^*(0)} 

06^4. 


Now define/(0,x) := (0,x) —T'*(0) so that, by the above definition,/(0)(,Xi) ='I'(xi) and/(02, X 2 ) ='I'(x 2 ). 
Now we have 

^'(Xl) = /(0t,Xi) < /(02,Xi) 

< /(02,X2) + |(02,Xi - X2)| 

= T'(x 2) + 1(02,Xi -X2)| 

< ^'(X 2 ) + i?||xi - X 2 II 2 , 

where the fourth step follows from the norm bound on 0^. Similarly we have 

T'(x 2 ) < «'(xi) + R ||xi - X 2 II 2 

This establishes the result. □ 


B Proof of Theorem |4] 

Theorem 4. Suppose we are given a stream of random samples (xi, yi),, (x^, yx) drawn from a distribution T) 
over X X y. Let T'(-) be a concave, Lipschitz link function. Let Algorithm^be executed with a dual feasible set 
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A 3 A^, r]t = l/Vi and ri[ = Then, the average model w = output by the algorithm satisfies, 

with probability at least 1 — <5, 


w*GW 


’P'i'(w) > sup Pit(w*) — ^3, I 5 j ~ 2y/T ~ 2sfT 


1 


Proof. For this proof we shall assume that 'I' is L^r-Lipschitz so that its sufficient dual region can be bounded by an 
application of Lemma[^ Notice that the updates for (a, fi) can be written as follows: 


{Oit+l, fit+l) ^ ({at, fit) — (a,0)^t{^t, fit)) , 

ifia,fi) = 


where 

ar+(wt;xt,yt) - 4'*(a,/3) if yt > 0 
fir-{wt-,Xt,yt)-'i’*ia,fi) if yt < 0 

which can be interpreted as simple gradient descent with ^t■ Moreover, since is concave, if is convex with respect 
to {a, fi) for every t. Note that the terms r+(wt;xt,yt) and {wt',y:.t,yt) do not involve a, fi and hence act as 
arbitrary bounded positive constants for this part of the analysis. 

Note that by Lemma we have the radius of Atv bounded by Also, since is a monotone function, by a 
similar argument, 'h*(Q!,/3) can be shown to be a i?r.)-Lipschitz function. For all the performance measures 

considered, we have BA < Br- Thus, if{a,fi) is a 2i3r-Lipschitz function. Hence, using a standard GIGA- 


style analysis 

T 

T 


Zinkevich 


120031 on the (descent) updates on at and fit in Algorithm[T| we have (for pt = ■^) 




[«tr+(wt; Xt,yt) + fitr (w^; Xt,yt) - 'S*{at,fit)] 

T 

< [ar+(wt; Xt,yt) + fir~ {v^u ^t,yt) - + (L% A AB^) 

T T 


/I ^ 1 ^ 

= '^{7f;^r+{wt-,:!^t,yt),7f;^r-{^t;y^t,yt)) + (lI + AB^ 


2Vt’ 


where the last step follows from Fenchel conjugacy. 

Further, noting that |r+(wt; Xt,yt) \ yii:t-i,yi:t-ij = P(wt), and t_(wt; xt,yt) \ xut-i, yut-i] 


= N{'Wt), we use the standard online-batch conversion bounds 
and individually to obtain w.h.p. 


Cesa-Bianchi et al. 


1 2001) to the loss functions r+ 


;^^r+(wt; xt,yt) < ^P(wt) 


T 


T 


(wt; xt,yt) < ^ Af(wt) 






2B? 1 

— logj 


2B? I 1 


By monotonicity of T*, we get 


[atr+{wt; Xt,yt) + fitr (wt; xt,yt) - 'i>*{at,fit)] 


252 


1 1 




T 


S' T 


252 


T 


'- 452 ) 


2Vt 


14 

























1 


^ f N{wt)^ + + {Ll + ABl) ^ 

;^wt), f-(^^wt)) +(5^(y^|^b^) +{Ll+ABl) 


< ^' ( r ■*' ( 


2y/T 


= ^'(P(w), iV(w)) + + (L|+4B2) 


2Vr’ 


(1) 


where the second inequality follows from stability of 4', and the third inequality follows from concavity of and r~, 
Jensen’s inequality, and stability of 

Similarly, the update to w can be written as 

Wt+i ^ IIw (wt - ?7tVw^f(wt)), 

where IIw is the projection operator for the domain W and 

/’t’l'w'i = / + 4'*(at,/?t) if y* > 0 

1 -/3tf"(w;xt,yt) + 4'*(a4,/3t) ify* < 0 

Since r+, r“ are concave and the term does not involve w, is convex in w for all t. Also, we can show 


that ff(w) is an {L^ ■ Lr)-Lipschitz function. Hence, applying a standard GIGA analysis 
(ascent) update on Wj in Algorithm 1 (with = ^), we have for any w* G W, 

1 ^ 

y X! [“tf’^(wt;xt,yt) + Ptr (wt;Xt,y() - 4'*(at,/3t)] 


Zinkevich 


1 200^ to the 




T 

- + /3tr~{w*;xt,yt) - A/* {at, I3t)] - {lIL^ + 


Again, observing that by linearity of expectation, we have 

Exj.y* |Q!tr+(w*;xt,yt)+/3tr“(w*;xt,yt) I xi:t_i,yi:t_i] = +/3tAt(w*), 

which gives us, through an online-batch conversion argument |Cesa-Bianchi et aLjpOOll w.h.p, 

1 ^ 

[atr+{wt;xt,yt) + PtTt (wt;xt,yt) - 4'*(at,/3t)] 




1 1 /or2 d2 T 

- ^ [atP(w*) + /3tA^(w*)] - y XI “ V —^ 


T 


2VT 


- + l3tN{w*)] - - 








= aP(w*) + /?A^(w*) — — 


2LlBi , 1 


r '"^(5 
1 


log 7 — {L%L^ + i?w) 


2VT 


log - - (l|l 2 + i?w) 


> 


inf |aP(w*) + l 3 N{w*) - log^- {L%LI + 

4'(P(w*), At(w*)) - t- (L|l2 ^ 


2VT 


2Vr’ 


(2) 


where the second step follows from concavity of and Jensen’s inequality, in the third step a = ^ St=i ™d 
Pt, and the last step follows from Fenchel conjugacy. 


Combining Eq. Q and Q gives us the desired result. 


□ 
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C Proof of Theorem |5] 

Theorem 5. Suppose we have the problem setting in Theorem^ivith the '^c-mean performance measure being optimized 
for. Consider a modification to Algorithm^wherein the reward functions are changed to rf (•) = + e{t), and 

rf {■) = r~ {■) + e{t) for e{t) = Then, the average model w = }^ output by the algorithm satisfies, 

with probability at least 1 — <5, 

- O • 

Proof Suppose 'I>(w + e, u + e) < 4 '(m, v) + (5^(e) as before. Let rf (•) = + e(f), and rf {■) = r~ [■) + e(f). 

Let us make all updates with respect to r'^ . Let r(e) be the radius of the sufficient dual domain A for a given 
regularization e. Also let e = ^ We will assume throughout that e(t) = 0(1). Then we have: 


1 ^ 

Xt,yt) + Ptrt (w^; Xt,yt) - T'*(at,/3t)] 

< [art+(wt; xt,yt) + (wt; xt,yt) - T'*(a,^)]} + O 

rp rp 

= xt,2/t)+e + + e - T'*(a,/3)| + O 

rp p 

= '^(^'^r+{wt;xt,yt) + e,^'^r-{svt]Xt,yt)+€^ + O 


T 


(3) 


= ^51’' i^t]x:t,yt)] + (5vE,(e) + O ( ^ 






We can now use online to batch conversion bounds Cesa-Bianchi et al. | 2QQ1| , and monotonicity of ^ to get 


-y] [atr+(wt; xt,yt) + Ptr (w*; Xt,yt) - T'*(at,/3t)] 




< T'(P(w), N{'w)') + 5^,\^0 

For the primal updates, we get, for any w* G W, 


Vt 


+ ^\i/(e) + O 


r(e) 

Vt 


(4) 


[atr:^{wt;xt,yt) + Ptrt (wt;xt,yt) - T'*(at,/3t)] 


- [atrt{vr*;xt,yt) + /3trt {w*-,xt,yt) - A/*{at,Pt)] - O 


= + Ptr (-”* 


t=l 

T 


w ]Xt,yt) - '^*iat,l3t)] + ^^e{t){at + Pt) - 


- [atr'^i-w*-,xt,yt) + Ptr {■w*]Xt,yt) - Ai*{at,Pt)] - O 

since e(t), at, Pt > 0. Again using an online-batch conversion argument Cesa-Bianchi et al. |2001] we get w.h.p. 


[atr+(wt;xt,yt) -f Ptr^ (wt;xt,j/t) - '^*{at,Pt)\ > T'(P(w*), iV(w*)) - O ■ 


(5) 
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Combining Eq. Q and Q gives 


us 


^{P{W), Niw)) > N{W*)) - o ^ - 5^(e-) 

For G-mean, i5^(x) = ^/x, and by an application of Lemmaj^we have r(e) = 0{l/^/e). Thus we have 

1 \ ,'X / 1 


T'(P(w), A^(w)) > T'(P(w*), N{w*)^ — O 


-d(^\ -VI 


VriJ \Vt 


For e = 


0(^),weget 


vI/(P(w), N{W)) > vI-(P(w*), Niw*)) - ^ 


This can be achieved with e(f) = 


□ 


D Proof of Theorem [8] 

Theorem 8. Let Algorithm^be executed with a petfonnance measure P(a,b) reward functions that offer values 
in the range [0, m). Let V* := sup^gyy P(a,b) (w). Also let At = V* — P{a..v>) (wt) be the excess error for the model 
Wt generated at time t. Then there exists a value rjim) < 1 such that for At < Ag • 77 ( 771 )*. 

Proof In order to be generic in its treatment, the proof will require the following regularity conditions on the perfor¬ 
mance measure 

1. &o ^ 0 

2. a — Viw) • 7 > 0 for all w € W 

3. /3 — P(w) • (5 > 0 for all w G W 

4. —1 < / < 7 • P(w) + S ■ 7V(w) < g for all w G W 

Define et '■= V{wt+i,Vt) — Vt- Then we can state the following lemmata which together yield the convergence bound 
proof 

Lemma 10. Vfj > — t;* 

Proof Assume that for some w*, P(w*) = vt + Ct + e' where e' > 0. Then we have 

Vi-w*,vt) = + e'^ (1+ 7 • P(w*) + (5 • iV(w*)) - et 


> 


et 


+ e' ) (1 + /) — et 


,1 + / 

= e'(l + /)>0, 

which contradicts the fact that no classifier can achieve a valuation greater than vt + et at level vt, thus proving the 
desired result. □ 

Lemma 11. For any w that achieves V (w, v) = v + e such that e > 0, we have 

P(w) > u + 


ff + 1 
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Proof. Let v' = v + We will show that V (w, v') > v' which will establish the result by pseudo-linearity. We 
have 


L(w,'(;') — v’ = 


> 


c+ {a — v'"f) ■ P(w) -|- (/3 — v'6) ■ A^(w) — v' 

c+ (a — v"f) ■ P(w) + (p — vS) ■ iV(w) — v' - (7 • P(w) -|- 6 ■ N(w)) 

5 + 1 

V + e — v' - ^ — (7 • P(w) -I- (5 • 7V(w)) 

5 + 1 


5 + 1 


where we have used the bounds on 7 • P(w) -|- <5 • TV(w) and the fact that 1 -f p > 0. 


□ 


Given the above results we can establish the convergence bound. More specifically, we can show the following: 
let At = V* — V{'Wt)- Then we have 

At+i < • At 

5 + 1 

To see this, consider the following 


At+i = iP* - iP(wt+i) <V* -[vt + 

= v*-(v{^t) + 


et 


5 + 1 

(l + /)(^*-^(wt)) 


<V*-{vt 


5 


1 


(i + f){r*-vt) 

5 + 1 

= A* - ^ • At = ^ • At, 


5 + 1 


5 + 1 


which proves the result. Notice that Tablej^gives the rates of convergence for the different performance measures by 
calculating bounds on the value of for those performance measures. □ 


E An analysis of the AMP Algorithm under Inexact Maximizations 

For this and the next section, we will, for the sake of simplicity, we will focus only on the F-measure for /3 = 1 and 
p = 1/2 so that 5 = 1. For this setting, the F-measure looks like the following: F{P, N) = 2 +^p‘-n ’ valuation 

function looks like L(w, u) = (1 — v/2) ■ P(w) + v/2 ■ N{'w). We shall denote the performance measure as F(w), 
and its optimal value as F*. We will assume that the reward functions give bounded rewards in the range [0, m). 

So far we assumed that Step 4 in the Algorithm AMP gave us wt+i such that 

L(wt+i,ut) = max L(w,ut) 
wew 

Now we will only assume that wj+i satisfies 

L(wt+i,ut) = max L(w,ut)-et 
wG W 

We also assume that the level Vt is only approximated in Step 5 of AMP, i.e. using Lemmaj^we have 


vt = F{wt) + St 


where St is a signed real number. 

Given these approximations, we can prove the following results 

Lemma 12. The following hold for the setting described above 

1- tf St < 0 then et > 0 

2. If St > 0 then et > —St (l + y) 
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5. IfF* < Vt (which can happen only if 6t > Oj, then Ct < 0 

4. If Ct < 0 then F* < vt 

5. We have 

(a) If Ct > 0, then e* > {F* — Vt). 

(b) If Ct < 0, then e* > {F* - Vt). 

6. IfV{yv, v) = V F e, then 

(a) If e > 0 then F{w) > u + 

(b) lfe<0 then F(w) > w + 

Proof We give the proof in parts 

1. If 5* < 0 then this means that there exists a w such that F(w) > Vt- The result then follows from pseudo 
linearity. 

2. Vt = F{wt) + St gives us, by pseudo linearity of F-measure, 

(1 - v,/2) . P(w,) + .,./2 . «(w,) = (l + (l + f) . 

The bound on et now follows from its definition. 

3. Suppose et > 0 then by pseudo linearity of F-measure, we have, for some w, V(w,vt) > Vt which means 
F(w) > Vt which contradicts the assumption. 

4. Suppose there exists w* with F{w*) = vt + e' with e' > 0 then we have 

(1 - Vt/2) ■ P(w*) -f Vt/2 • iV(w*) =vt + e' P(w ) ^ jV(w ^ 
which contradicts the fact that et < 0. 

5. Part (a) is simply Lemma 10. For part (b), we will prove that F* < vt F Since 2 :^ > 0, the result will 
follow. Assume the contrapositive that some w* achieves F{w*) = vt F + e' for some e' > 0. Using the 
pseudo linearity of F-measure (and using the shorthand v' = Vt F F e'), this can be expressed as 

(1 - v'/2) ■ P(w*) -f v'/2 ■ 7V(w*) = v' 


where for some e' > 0. Then we have 

(1 - Vt/2) ■ P(w*) -f Vt/2 ■ N{w*) -vt-et=v' -Vt-etF^- ( - N{w*)) 

2 \2 m ) 



where we have assumed that e' is chosen small enough so that -I- e' < 0 still and used the fact that 
P(w*) — N/w*) < m. 
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6. Part (a) is simply Lemma 11. To prove part (b), we let v' = v then we have 


(1 - tt) • ^(w) + ^ • iV(w)-u' = (1 --) • P(w) + - • 7V(w) - v'+ 


2 — m 
me 


(A^(w) — P(w)) 


2e 


= V + e — \ V 


= el 1- 


2 — m 
2 m 


2 — m 


2 — m 2 — TO 


= 0 , 


where the second inequality follows since 7V(w) — P(w) < to and e < 0 by using the bounds on the reward 
functions. This proves the result. 


□ 


E.l Convergence analysis 

We have the following cases with us 

1. Case 1 {5t < 0): In this case we are setting vt to a value less than the F-measure of the current classifier. This 
should hurt performance - we know that vt = F{wt) + St which gives us, on applying part (a) of the previous 
lemma using F* — Vt = At — St, the following 


^ 2 — TO ,, - , 

et > —^—(At - St). 


Note that we are guaranteed that Ct > 0 in this case. Now since the maximization in step 4 is also carried our 
approximately, we have V (wt+i, Ut) = Vt + et — et- Now we have two sub cases 


(a) Case 1.1 (et < et): In this case we can apply part 6(a) of the previous lemma to get the following result 


At+i < 


2m . 2m 

-At- 

2 + TO 2 + TO 


2et 

2 to 


(b) Case 1.2 (et > et): In this case we are actually making negative progress in the maximization step (since 
we have V (wt+i, Vt) < Vt) and we can only invoke Lemma 5.6(b) to get 


At+i 



Note that the above result should not be interpreted as a one shot step to a very good classifier. The above 
result holds along with the condition that et > et. Thus the performance of the classifier is lower bounded 
by Ct which depends on how far the current classifier is from the best. 

2. Case 2 (St > 0): In this case we are setting vt to the value higher than the F-measure of the current classifier. 
This can mislead the classifier and results in the following two sub-cases 


(a) Case 2.1 (F* > vt): In this case we are still setting vt to a legitimate value, i.e. one that is a valid F- 
measure for some classifier in the hypothesis class. This can only benefit the next optimization stage (in 
fact if we set vt = F*, then we would obtain the best classifier in this very iteration!). In this case e* > 0 
and we can use the analyses of Cases 1.1 and 1.2. 
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(b) Case 2.2 (F* < vt): In this case we are setting vt to an illegal value, one that is an unachievable value of 
F-measure. Consequently, using part 3 of the previous lemma, ct < 0 and using part(b) of the previous 
lemma we get 


et > 


-{At-St), 


which, upon applying part 6(b) of the previous lemma (since et — et < Ct < 0) will give us 

^t+i ^ 


, 2m , 2et 

^ ^ — At) + - 

2 — m 2 — m 


<—j. 

2 — m 


2et 


2 — 771 


We can combine the cases together as follows 

2m 


At_|_i < max 1 {5 < 0} 
2m 


At - 
2 + m 2 + m 


6t + T;^\A{et>et}-,^,l{6>0} 


< max 


2m , 
< .-At 


At 
2m 


2m 


11" 7^ — — 

m 2 + m 2 + m 


2et ^ f 1 2et 

1 {et > et} 


2 — m ’ 2 — m 


2 — m 

|i.i+ 


2 — m 




26+ 


2 m 2 — m 2 — m 

If we let r] = rj' = and = |5t| + et/m, then this gives us 

At+i < 'qAt + q'it, 

which gives us 

i 

V 

This concludes our analysis. 


T-l 


At < ry^Ao + 3 ■ X! 


i=0 


2m _ 2et 

-Ot + 


2 — m 


2 — m 


F Proof of Theorem |9] 

Theorem 9. Let Algorithm^be executed with a performance measure ’P(a,b) eind reward functions with range [0, m). 
Let 77 = q{m) be the rate of convergence guaranteed for 7^(3,b) by the AMP algorithm. Set the epoch lengths to 

Se,Sg = O Then after e = logi (Mog^ epochs, we can ensure with probability at least 1 — i5 that 

V* — ’P(a,b) (we) < e. Moreover the number of samples consumed till this point is at most O (^)- 


Proof Using Hoeffding’s inequality, standard regret and online-to-batch guarantees Cesa-Bianchi et al. 120011, Zinke 


vich 1 2003 1 , we can ensure that, if the stream lengths for the Model optimization stage and Challenge level estimation 
stage procedures are Sg and Sg respectively, then for some fixed c > 0 that is independent of the stream length, we 
have 

Let T = logi (i log^ i) and Sg = log j and s} = 4c^ ( ^ j log j - this gives us, for each e, with 

probability at least 1 — <5 /T, 

Thus, using a union bound, with probability at least 1 — 5, we have, by the discussion in the previous section, 

, T-l , 

V V- T-i. / T A . V 


At < 77 ^Ao + — *6 < V^Ao + —Tq 




i=0 
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,-2 1 , V ' 


1. 2 I 


< e An 


e 7] ^ \e 

i 


< eAo log ^ - + A logi ( - log^ - ) elog ^ - 


1 


’7log^/ ’ 


where the last step follows from the fact that for any e < 1/e^, we have 


1- 2 I 


log - log - < log - 


1 


Let d = ^Ao + so that we can later set e' = e/d, and s = 4c^ (l + so that Se + = s 

The total number of samples required can then be calculated as 


/ \ 2e 

( 1 ) *°* 


y:,e+4 = .bg^y:(i) =‘i«s/Yzh{i) 


2e 


T 1 


e=l ” e=l 

This gives the number of samples required as 


rj^ \ri 


T 1 1 


log"- 


. f]2 g 


1 


1 


1 


o ^ log - log log - + log 


1 


to get an e-accurate solution with confidence 1 — <5. 


□ 
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